NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Personalized Meta-Federated Learning for IoT-Enabled Health Monitoring

https://doi.org/10.1109/TCAD.2024.3388908

Jia, Zhenge; Zhou, Tianren; Yan, Zheyu; Hu, Jingtong; Shi, Yiyu (October 2024, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems)

Full Text Available
A Reconfigurable Bandpass Filter with Ferroelectric Devices for Intracardiac Electrograms Monitoring

https://doi.org/10.1109/MWSCAS60917.2024.10658827

Jia, Jianwei; Jia, Zhenge; Phadke, Omkar; Choe, Gihun; Shi, Yiyu; Yu, Shimeng (August 2024, IEEE)

Full Text Available
EQ-ViT: Algorithm-Hardware Co-Design for End-to-End Acceleration of Real-Time Vision Transformer Inference on Versal ACAP Architecture

https://doi.org/10.1109/TCAD.2024.3443692

Dong, Peiyan; Zhuang, Jinming; Yang, Zhuoping; Ji, Shixin; Li, Yanyu; Xu, Dongkuan; Huang, Heng; Hu, Jingtong; Jones, Alex K; Shi, Yiyu; et al (November 2024, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems)

Full Text Available
EQ-ViT: Algorithm-Hardware Co-Design for End-to-End Acceleration of Real-Time Vision Transformer Inference on Versal ACAP Architecture

Dong, Peiyan; Zhuang, Jinming; Yang, Zhuoping; Ji, Shixin; Li, Yanyu; Xu, Dongkuan; Huang, Heng; Hu, Jingtong; Jones, Alex K; Shi, Yiyu; et al (October 2024, IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS)

While Vision Transformers (ViTs) have shown consistent progress in computer vision, deploying them for real-time decision-making scenarios (< 1 ms) is challenging. Current computing platforms like CPUs, GPUs, or FPGA-based solutions struggle to meet this deterministic low-latency real-time requirement, even with quantized ViT models. Some approaches use pruning or sparsity to reduce model size and latency, but this often results in accuracy loss. To address the aforementioned constraints, in this work, we propose EQ-ViT, an end-to-end acceleration framework with novel algorithm and architecture co-design features to enable real-time ViT acceleration on AMD Versal Adaptive Compute Acceleration Platform (ACAP). The contributions are four-fold. First, we perform in-depth kernel- level performance profiling & analysis and explain the bottlenecks for existing acceleration solutions on GPU, FPGA, and ACAP. Second, on the hardware level, we introduce a new spatial and heterogeneous accelerator architecture, EQ-ViT architec- ture. This architecture leverages the heterogeneous features of ACAP, where both FPGA and artificial intelligence engines (AIEs) coexist on the same system-on-chip (SoC). Third, On the algorithm level, we create a comprehensive quantization-aware training strategy, EQ-ViT algorithm. This strategy concurrently quantizes both weights and activations into 8-bit integers, aiming to improve accuracy rather than compromise it during quanti- zation. Notably, the method also quantizes nonlinear functions for efficient hardware implementation. Fourth, we design EQ- ViT automation framework to implement the EQ-ViT architec- ture for four different ViT applications on the AMD Versal ACAP VCK190 board, achieving accuracy improvement with 2.4%, and average speedups of 315.0x, 3.39x, 3.38x, 14.92x, 59.5x, 13.1x over computing solutions of Intel Xeon 8375C vCPU, Nvidia A10G, A100, Jetson AGX Orin GPUs, and AMD ZCU102, U250 FPGAs. The energy efficiency gains are 62.2x, 15.33x, 12.82x, 13.31x, 13.5x, 21.9x.
more » « less
Full Text Available
Enabling On-Device Large Language Model Personalization with Self-Supervised Data Selection and Synthesis

Qin, Ruiyang; Xia, Jun; Jia, Zhenge; Jiang, Meng; Abbasi, Ahmed; Zhou, Peipei; Hu, Jingtong; Shi, Yiyu (June 2024, ACM/IEEE)

After a large language model (LLM) is deployed on edge devices, it is desirable for these devices to learn from user-generated conversation data to generate user-specific and personalized responses in real-time. However, user-generated data usually contains sensitive and private information, and uploading such data to the cloud for annotation is not preferred if not prohibited. While it is possible to obtain annotation locally by directly asking users to provide preferred responses, such annotations have to be sparse to not affect user experience. In addition, the storage of edge devices is usually too limited to enable large-scale fine-tuning with full user-generated data. It remains an open question how to enable on-device LLM personalization, considering sparse annotation and limited on-device storage. In this paper, we propose a novel framework to select and store the most representative data online in a self-supervised way. Such data has a small memory footprint and allows infrequent requests of user annotations for further fine-tuning. To enhance fine-tuning quality, multiple semantically similar pairs of question texts and expected responses are generated using the LLM. Our experiments show that the proposed framework achieves the best user-specific content-generating capability (accuracy) and fine-tuning speed (performance) compared with vanilla baselines. To the best of our knowledge, this is the very first on-device LLM personalization framework.
more » « less
Full Text Available
Enabling On-Device Large Language Model Personalization with Self-Supervised Data Selection and Synthesis

https://doi.org/10.1145/3649329.3655665

Qin, Ruiyang; Xia, Jun; Jia, Zhenge; Jiang, Meng; Abbasi, Ahmed; Zhou, Peipei; Hu, Jingtong; Shi, Yiyu (June 2024, ACM)

Full Text Available
SCARIF: Towards Carbon Modeling of Cloud Servers with Accelerators

https://doi.org/10.1109/ISVLSI61997.2024.00095

Ji, Shixin; Yang, Zhuoping; Chen, Xingzhen; Cahoon, Stephen; Hu, Jingtong; Shi, Yiyu; Jones, Alex K; Zhou, Peipei (July 2024, IEEE)

Embodied carbon has been widely reported as a significant component in the full system lifecycle of various computing systems green house gas emissions. Many efforts have been undertaken to quantify the elements that comprise this embodied carbon, from tools that evaluate semiconductor manufacturing to those that can quantify different elements of the computing system from commercial and academic sources. However, these tools cannot easily reproduce results reported by server vendors' product carbon reports and the accuracy can vary substantially due to various assumptions. Furthermore, attempts to determine green house gas contributions using bottom-up methodologies often do not agree with system-level studies and are hard to rectify. Nonetheless, given there is a need to consider all contributions to green house gas emissions in datacenters, we propose SCARIF, the Server Carbon including Accelerator Reporter with Intelligence-based Formulation tool. SCARIF has three main contributions: (1) We first collect reported carbon cost data from server vendors and design statistic models to predict the embodied carbon cost so that users can get the embodied carbon cost for their server configurations. (2) We provide embodied carbon cost if users configure servers with accelerators including GPUs, and FPGAs. (3) By using case studies, we show that certain design choices of data center management might flip by the insight and observation from using SCARIF. Thus, SCARIF provides an opportunity for large-scale datacenter and hyperscaler design. We release SCARIF as an open-source tool at https://github.com/arc-research-lab/SCARIF.
more » « less
Full Text Available
SSR: Spatial Sequential Hybrid Architecture for Latency Throughput Tradeoff in Transformer Acceleration

Zhuang, Jinming; Yang, Zhuoping; Ji, Shixin; Huang, Heng; Jones, Alex; Hu, Jingtong; Shi, Yiyu; Zhou, Peipei (June 2024, The 32nd ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA 2024))

Full Text Available
CHARM 2.0: Composing Heterogeneous Accelerators for Deep Learning on Versal ACAP Architecture

https://doi.org/10.1145/3686163

Zhuang, Jinming; Lau, Jason; Ye, Hanchen; Yang, Zhuoping; Ji, Shixin; Lo, Jack; Denolf, Kristof; Neuendorffer, Stephen; Jones, Alex; Hu, Jingtong; et al (August 2024, ACM Transactions on Reconfigurable Technology and Systems)

Dense matrix multiply (MM) serves as one of the most heavily used kernels in deep learning applications. To cope with the high computation demands of these applications, heterogeneous architectures featuring both FPGA and dedicated ASIC accelerators have emerged as promising platforms. For example, the AMD/Xilinx Versal ACAP architecture combines general-purpose CPU cores and programmable logic with AI Engine processors optimized for AI/ML. An array of 400 AI Engine processors executing at 1 GHz can provide up to 6.4 TFLOPS performance for 32-bit floating-point (FP32) data. However, machine learning models often contain both large and small MM operations. While large MM operations can be parallelized efficiently across many cores, small MM operations typically cannot. We observe that executing some small MM layers from the BERT natural language processing model on a large, monolithic MM accelerator in Versal ACAP achieved less than 5% of the theoretical peak performance. Therefore, one key question arises:How can we design accelerators to fully use the abundant computation resources under limited communication bandwidth for end-to-end applications with multiple MM layers of diverse sizes? We identify the biggest system throughput bottleneck resulting from the mismatch between massive computation resources of one monolithic accelerator and the various MM layers of small sizes in the application. To resolve this problem, we propose the CHARM framework to composemultiple diverse MM accelerator architecturesworking concurrently on different layers within one application. CHARM includes analytical models which guide design space exploration to determine accelerator partitions and layer scheduling. To facilitate system designs, CHARM automatically generates code, enabling thorough onboard design verification. We deploy the CHARM framework on four different deep learning applications in FP32, INT16, and INT8 data types, including BERT, ViT, NCF, and MLP, on the AMD/Xilinx Versal ACAP VCK190 evaluation board. Our experiments show that we achieve 1.46 TFLOPS, 1.61 TFLOPS, 1.74 TFLOPS, and 2.94 TFLOPS inference throughput for BERT, ViT, NCF, and MLP in FP32 data type, respectively, which obtain 5.29\(\times\), 32.51\(\times\), 1.00\(\times\), and 1.00\(\times\)throughput gains compared to one monolithic accelerator. CHARM achieves the maximum throughput of 1.91 TOPS, 1.18 TOPS, 4.06 TOPS, and 5.81 TOPS in the INT16 data type for the four applications. The maximum throughput achieved by CHARM in the INT8 data type is 3.65 TOPS, 1.28 TOPS, 10.19 TOPS, and 21.58 TOPS, respectively. We have open-sourced our tools, including detailed step-by-step guides to reproduce all the results presented in this paper and to enable other users to learn and leverage CHARM framework and tools in their end-to-end systems:https://github.com/arc-research-lab/CHARM.
more » « less
Full Text Available
SSR: Spatial Sequential Hybrid Architecture for Latency Throughput Tradeoff in Transformer Acceleration

https://doi.org/10.1145/3626202.3637569

Zhuang, Jinming; Yang, Zhuoping; Ji, Shixin; Huang, Heng; Jones, Alex K.; Hu, Jingtong; Shi, Yiyu; Zhou, Peipei (April 2024, ACM/SIGDA International Symposium on Field Programmable Gate Arrays)

With the increase in the computation intensity of the chip, the mismatch between computation layer shapes and the available computation resource significantly limits the utilization of the chip. Driven by this observation, prior works discuss spatial accelerators or dataflow architecture to maximize the throughput. However, using spatial accelerators could potentially increase the execution latency. In this work, we first systematically investigate two execution models: (1) sequentially (temporally) launch one monolithic accelerator, and (2) spatially launch multiple accelerators. From the observations, we find that there is a latency throughput tradeoff between these two execution models, and combining these two strategies together can give us a more efficient latency throughput Pareto front. To achieve this, we propose spatial sequential architecture (SSR) and SSR design automation framework to explore both strategies together when deploying deep learning inference. We use the 7nm AMD Versal ACAP VCK190 board to implement SSR accelerators for four end-to-end transformer-based deep learning models. SSR achieves average throughput gains of 2.53x, 35.71x, and 14.20x under different batch sizes compared to the 8nm Nvidia GPU A10G, 16nm AMD FPGAs ZCU102, and U250. The average energy efficiency gains are 8.51x, 6.75x, and 21.22x, respectively. Compared with the sequential-only solution and spatial-only solution on VCK190, our spatial-sequential-hybrid solutions achieve higher throughput under the same latency requirement and lower latency under the same throughput requirement. We also use SSR analytical models to demonstrate how to use SSR to optimize solutions on other computing platforms, e.g., 14nm Intel Stratix 10 NX.
more » « less

« Prev Next »

Search for: All records